189 research outputs found
Data Science and Big Data in Energy Forecasting
This editorial summarizes the performance of the special issue entitled Data Science and Big Data in Energy Forecasting, which was published at MDPI’s Energies journal. The special issue took place in 2017 and accepted a total of 13 papers from 7 different countries. Electrical, solar and wind energy forecasting were the most analyzed topics, introducing new methods with applications of utmost relevance.Ministerio de Competitividad TIN2014-55894-C2-RMinisterio de Competitividad TIN2017-88209-C2-
A Framework for Evaluating Land Use and Land Cover Classification Using Convolutional Neural Networks
Analyzing land use and land cover (LULC) using remote sensing (RS) imagery is essential
for many environmental and social applications. The increase in availability of RS data has led to the
development of new techniques for digital pattern classification. Very recently, deep learning (DL)
models have emerged as a powerful solution to approach many machine learning (ML) problems.
In particular, convolutional neural networks (CNNs) are currently the state of the art for many image
classification tasks. While there exist several promising proposals on the application of CNNs to
LULC classification, the validation framework proposed for the comparison of different methods
could be improved with the use of a standard validation procedure for ML based on cross-validation
and its subsequent statistical analysis. In this paper, we propose a general CNN, with a fixed
architecture and parametrization, to achieve high accuracy on LULC classification over RS data
from different sources such as radar and hyperspectral. We also present a methodology to perform
a rigorous experimental comparison between our proposed DL method and other ML algorithms
such as support vector machines, random forests, and k-nearest-neighbors. The analysis carried out
demonstrates that the CNN outperforms the rest of techniques, achieving a high level of performance
for all the datasets studied, regardless of their different characteristics.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-
Tackling Ant Colony Optimization Meta-Heuristic as Search Method in Feature Subset Selection Based on Correlation or Consistency Measures
This paper introduces the use of an ant colony optimization
(ACO) algorithm, called Ant System, as a search method in two wellknown
feature subset selection methods based on correlation or consistency
measures such as CFS (Correlation-based Feature Selection) and
CNS (Consistency-based Feature Selection). ACO guides the search using
a heuristic evaluator. Empirical results on twelve real-world classification
problems are reported. Statistical tests have revealed that InfoGain is a
very suitable heuristic for CFS or CNS feature subset selection methods
with ACO acting as search method. The use of InfoGain is shown to be
the significantly better heuristic over a range of classifiers. The results
achieved by means of ACO-based feature subset selection with the suitable
heuristic evaluator are better for most of the problems comparing
with those obtained with CFS or CNS combined with Best First search.MICYT TIN2007-68084- C02-02MICYT TIN2011-28956-C02-02Junta de Andalucía P11-TIC-752
Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach
This paper presents a novel procedure to apply in a sequential
way two data preparation techniques from a different nature such as
data cleansing and feature selection. For the former we have experienced
with a partial removal of outliers via inter-quartile range whereas for
the latter we have chosen relevant attributes with two widespread feature
subset selectors like CFS (Correlation-based Feature Selection) and
CNS (Consistency-based Feature Selection), which are founded on correlation
and consistency measures, respectively. Empirical results on seven
difficult binary and multi-class data sets, that is, with a test error rate of
at least a 10%, according to accuracy, with C4.5 or 1-nearest neighbour
classifiers without any kind of prior data pre-processing are outlined.
Non-parametric statistical tests assert that the meeting of the aforementioned
two data preparation strategies using a correlation measure for
feature selection with C4.5 algorithm is significant better, measured with
roc measure, than the single application of the data cleansing approach.
Last but not least, a weak and not very powerful learner like PART
achieved promising results with the new proposal based on a consistency
measure and is able to compete with the best configuration of C4.5. To
sum up, bearing in mind the new approach, for roc measure PART classifier
with a consistency metric behaves slightly better than C4.5 and a
correlation measureMICYT TIN2007-68084-C02- 02MICYT TIN2011-28956-C02-02Junta de Andalucía P11-TIC-752
Deleting or Keeping Outliers for Classifier Training?
This paper introduces two statistical outlier
detection approaches by classes. Experiments on binary and
multi-class classification problems reveal that the partial
removal of outliers improves significantly one or two
performance measures for C4.S and I-nearest neighbour
classifiers. Also, a taxonomy of problems according to the
amount of outliers is proposed.MICYT TIN2007- 68084-C02-02MICYT TIN2011-28956-C02-02Junta de Andalucía Pll-TIC-752
Minería de Datos: Conceptos y Tendencias
Hoy en día, la minería de datos (MD) está consiguiendo cada vez más captar la atención de las empresas. Todavía es
infrecuente oír frases como “deberíamos segmentar a nuestros clientes utilizando herramientas de MD”, “la MD
incrementará la satisfacción del cliente”, o “la competencia está utilizando MD para ganar cuota de mercado”. Sin
embargo, todo apunta a que más temprano que tarde la minería de datos será usada por la sociedad, al menos con el
mismo peso que actualmente tiene la Estadística. Así que ¿qué es la minería de datos y qué beneficios aporta?
¿Cómo puede influir esta tecnología en la resolución de los problemas diarios de las empresas y la sociedad en
general? ¿Qué tecnologías están detrás de la minería de datos? ¿Cuál es el ciclo de vida de un proyecto típico de
minería de datos? En este artículo, se intentarán aclarar estas cuestiones mediante una introducción a la minería de
datos: definición, ejemplificar problemas que se pueden resolver con minería de datos, las tareas de la minería de
datos, técnicas usadas y finalmente retos y tendencias en minería de datos
Improving the Evolutionary Coding for Machine Learning Tasks
The most influential factors in the quality of the solutions
found by an evolutionary algorithm are a correct coding of the
search space and an appropriate evaluation function of the potential
solutions. The coding of the search space for the obtaining of decision
rules is approached, i.e., the representation of the individuals of
the genetic population. Two new methods for encoding discrete and
continuous attributes are presented. Our “natural coding” uses one
gene per attribute (continuous or discrete) leading to a reduction in
the search space. Genetic operators for this approached natural coding
are formally described and the reduction of the size of the search
space is analysed for several databases from the UCI machine learning
repository.Comisión Interministerial de Ciencia y Tecnología TIC1143–C03–0
Partitioning-Clustering Techniques Applied to the Electricity Price Time Series
Clustering is used to generate groupings of data from a large dataset, with the intention of representing the behavior of a system as accurately as possible. In this sense, clustering is applied in this work to extract useful information from the electricity price time series. To be precise, two clustering techniques, K-means and Expectation Maximization, have been utilized for the analysis of the prices curve, demonstrating that the application of these techniques is effective so to split the whole year into different groups of days, according to their prices conduct. Later, this information will be used to predict the price in the short time period. The prices exhibited a remarkable resemblance among days embedded in a same season and can be split into two major kind of clusters: working days and festivities
Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches
We address the feature subset selection problem for classification tasks. We examine the performance of two hybrid strategies that
directly search on a ranked list of features and compare them with two widely used algorithms, the fast correlation based filter
(FCBF) and sequential forward selection (SFS). The pro-posed hybrid approaches provide the possibility of efficiently applying any
subset evaluator, with a wrap-per model included, to large and high-dimensional domains. The experiments performed show that
our two strategies are competitive and can select a small subset of features without degrading the classifica-tion error or the
advantages of the strategies under study
Analysis of Measures of Quantitative Association Rules
This paper presents the analysis of relationships among different
interestingness measures of quality of association rules as first step
to select the best objectives in order to develop a multi-objective algorithm.
For this purpose, the discovering of association rules is based on
evolutionary techniques. Specifically, a genetic algorithm has been used
in order to mine quantitative association rules and determine the intervals
on the attributes without discretizing the data before. The algorithm
has been applied in real-word climatological datasets based on Ozone and
Earthquake data.Ministerio de Ciencia y Tecnología TIN2007-68084-C-00Junta de Andalucía P07-TIC-0261
- …